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ABSTRACT 

Distributed frameworks are gaining increasingly widespread 
use in applications that process large amounts of data. One 
important example application is large scale similarity search, 
for which Locality Sensitive Hashing (LSH) has emerged 
as the method of choice, specially when the data is high- 
dimensional. At its core, LSH is based on hashing the data 
points to a number of buckets such that similar points are 
more likely to map to the same buckets. To guarantee high 
search quality, the LSH scheme needs a rather large num- 
ber of hash tables. This entails a large space requirement, 
and in the distributed setting, with each query requiring 
a network call per hash bucket look up, this also entails a 
big network load. The Entropy LSH scheme proposed by 
Panigrahy significantly reduces the number of required hash 
tables by looking up a number of query offsets in addition 
to the query itself. While this improves the LSH space re- 
quirement, it does not help with (and in fact worsens) the 
search network efficiency, as now each query offset requires a 
network call. In this paper, focusing on the Euclidian space 
under I2 norm and building up on Entropy LSH, we propose 
the distributed Layered LSH scheme, and prove that it ex- 
ponentially decreases the network cost, while maintaining a 
good load balance between different machines. Our exper- 
iments also verify that our scheme results in a significant 
network traffic reduction that brings about large runtime 
improvement in real world applications. 

1. INTRODUCTION 

Similarity search is the problem of retrieving data objects 
similar to a query object. It has become an important com- 
ponent of modern data-mining systems, with applications 
ranging from de-duplication of web documents, content-based 
audio, video, and image search [241 1271 111] , collaborative fil- 
tering 13j, large scale genomic sequence alignment [5], nat- 
ural language processing 30], pattern classification [12], and 
clustering [BJ. 

In these applications, objects are usually represented by a 
high dimensional feature vector. A scheme to solve the sim- 
ilarity search problem constructs an index which, given a 
query point, allows for quickly finding the data points simi- 
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lar to it. In addition to the query search procedure, the index 
construction also needs to be time and space efficient. Fur- 
thermore, since today's massive datasets are typically stored 
and processed in a distributed fashion, where network com- 
munication is one of the most important bottlenecks, these 
methods need to be network efficient, as otherwise, the net- 
work load would slow down the whole scheme. 

An important family of similarity search methods is based 
on the notion of Locality Sensitive Hashing (LSH) [2T]. At 
its core, LSH is based on hashing the (data and query) points 
into a number of hash buckets such that similar points have 
higher chances of getting mapped to the same buckets. Then 
for each query, the nearest neighbor among the data points 
mapped to a same bucket as the query point is returned as 
the search result. 

LSH has been shown to scale well with the data dimension 
[211 125] . However, the main drawback of conventional LSH 
based schemes is that to guarantee a good search quality, 
one needs a large number of hash tables. This entails a 
rather large space requirement for the index, and also in the 
distributed setting, a large network load, as each hash bucket 
look up requires a communication over the network. To 
mitigate the space efficiency issue, Panigrahy [2S] proposed 
the Entropy LSH scheme, which significantly reduces the 
number of required hash tables, by looking up a number of 
query offsets in addition to the query itself. Even though 
this scheme improves the LSH space efficiency, it does not 
help with its network efficiency, as now each query offset 
lookup requires a network call. In fact, since the number of 
required offsets in Entropy LSH is larger than the number 
of required hash tables in conventional LSH, Entropy LSH 
amplifies the network inefficiency issue. 

In this paper, focusing on the Euclidian space under I2 norm 
and building up on the Entropy LSH scheme, we design the 
Layered LSH method for distributing the hash buckets over 
a set of machines which leads to a very high network effi- 
ciency. We prove that, compared to a straightforward dis- 
tributed implementation of LSH or Entropy LSH, our Lay- 
ered LSH method results in an exponential improvement in 
the network load (from polynomial in n, the number of data 
points, to sub-logarithmic in n), while maintaining a good 
load balance between the different machines. Our experi- 
ments also verify that our scheme results in large network 
traffic improvement that in turn results in significant run- 
time speedups. 



In the rest of this section, we first provide some background 
on the similarity search problem and the relevant methods, 
then discuss LSH in the distributed computation model, and 
finally present an overview of our scheme as well as our re- 
sults. 

1.1 Background 

In this section, we briefly review the similarity search prob- 
lem, the basic LSH and Entropy LSH approaches to solving 
it, the distributed computation framework and its instanti- 
ations such as MapReduce and Active DHT, and a straight- 
forward implementation of LSH in the distributed setting as 
well as its major drawback. 

Similarity Search: The similarity search problem is that 
of finding data objects similar to a query object. In many 
practical applications, the objects are represented by multi- 
dimensional feature vectors, and hence the problem reduces 
to finding objects close to the query object under the fea- 
ture space distance metric. The goal in all these problems is 
to construct an index, which given the query point, allows 
for quickly finding the search results. The index construc- 
tion and the query search both need to be space, time, and 
network efficient. 

Basic LSH: A method to solve the similarity search prob- 
lem over high dimensional large datasets is based on a spe- 
cific type of hash functions, namely Locality Sensitive Hash 
(LSH) functions, proposed by Indyk and Motwani [5T]. An 
LSH function maps the points in the feature space to a num- 
ber of buckets in a way that similar points map to the same 
buckets with a high chance. Then, a similarity search query 
can be answered by first hashing the query point and then 
finding the close data points in the same bucket as the one 
the query is mapped to. To guarantee both a good search 
quality and a good search efficiency, one needs to use multi- 
ple LSH functions and combine their results. Then, although 
this approach yields a significant improvement in the run- 
ning time over both the brute force linear scan and the space 
partitioning approaches 18, 5, 7 23, 19, 22 , unfortunately 
the required number of hash functions is usually large [5] [TH] , 
and since each hash table has the same size as the dataset, 
for large scale applications, this entails a very large space 
requirement for the index. Also, in the distributed setting, 
since each hash table lookup at query time corresponds to a 
network call, this entails a large network load which is also 
undesirable. 

Entropy LSH: To mitigate the space inefficiency of LSH, 
Panigrahy 29, introduced the Entropy LSH scheme. This 
scheme uses the same hash functions and indexing method 
as the basic LSH scheme. However, it uses a different query 
time procedure: In addition to hashing the query point, it 
hashes a number of query offsets as well and also looks up 
the hash buckets that any of these offsets map to. The idea 
is that the close data points are very likely to be mapped 
to either the same bucket as the query point or to the same 
bucket as one of the query offsets. This significantly reduces 
the number of hash tables required to guarantee the search 
quality and efficiency. Hence, this scheme significantly im- 
proves the index space requirement compared to the basic 
LSH method. However, it unfortunately does not help with 
the query network efficiency, as each query offset requires a 



network call. Indeed, since one can see that [211 1291 127] the 
number of query offsets required by Entropy LSH is larger 
than the number of hash tables required by basic LSH, the 
query network efficiency of Entropy LSH is even worse than 
that of the basic LSH. 

In this paper, we focus on the network efficiency of LSH 
in distributed frameworks. Two main instantiations of such 
frameworks are the batched processing system MapReduce 
|16j (with its open source implementation Apache Hadoop 
[T]), and the real-time processing system denoted as Ac- 
tive Distributed Hash Table (Active DHT), such as Twit- 
ter Storm [3]. The common feature in all these systems is 
that they process data in the form of (Key, Value) pairs, 
distributed over a set of machines. This distributed (Key, 
Value) abstraction is all we need for both our scheme and 
analyses to apply. However, to make the later discussions 
more concrete, here we briefly overview the mentioned dis- 
tributed systems. 

MapReduce: MapReduce [16] is a simple model for batched 
distributed processing using a number of commodity ma- 
chines, where computations are done in three phases. The 
Map phase reads a collection of (Key, Value) pairs from an 
input source, and by invoking a user defined Mapper func- 
tion on each input element independently and in parallel, 
emits zero or more (Key, Value) pairs associated with that 
input element. The Shuffle phase then groups together all 
the Mapper-emitted (Key, Value) pairs sharing the same 
Key, and outputs each distinct group to the next phase. 
The Reduce phase invokes a user-defined Reducer function 
on each distinct group, independently and in parallel, and 
emits zero or more values to associate with the group's Key. 
The emitted (Key, Value) pairs can then be written on the 
disk or be the input of the Map phase in a following itera- 
tion. 

Active DHT: A DHT (Distributed Hash Table) is a dis- 
tributed (Key, Value) store which allows Lookups, Inserts, 
and Deletes on the basis of the Key. The term Active refers 
to the fact that an arbitrary User Defined Function (UDF) 
can be executed on a (Key, Value) pair in addition to In- 
sert, Delete, and Lookup. Twitter's Storm [3] is an example 
of Active DHT that is gaining widespread use. The Active 
DHT model is broad enough to act as a distributed stream 
processing system and as a continuous version of MapRe- 
duce [26] . All the (Key, Value) pairs in a node of the active 
DHT are usually stored in main memory to allow for fast 
real-time processing of data and queries. 

In addition to the typical performance measures of total 
running time and total space, two other measures are very 
important for both MapReduce and Active DHTs. First, 
the total network traffic generated, that is the shuffle size 
for MapReduce and the number of network calls for Active 
DHT, and second, the maximum number of values with the 
same key; a high value here can lead to the "curse of the 
last reducer" in MapReduce [31] or to one compute node 
becoming a bottleneck in Active DHT. 

Next, we will briefly discuss a simple implementation of LSH 
in distributed frameworks. 



A Simple Distributed LSH Implementation: 

Each hash table associates a (Key, Value) pair to each data 
point, where the Key is the point's hash bucket, and the 
Value is the point itself. These (Key, Value) pairs are ran- 
domly distributed over the set of machines such that all the 
pairs with the same Key are on the same machine. This is 
done implicitly using a random hash function of the Key. 
For each query, first a number of (Key, Value) pairs corre- 
sponding to the query point are generated. The Value in 
all of these pairs is the query point itself. For basic LSH, 
per hash table, the Key is the hash bucket the query maps 
to, and for Entropy LSH, per query offset, the Key is the 
hash bucket the offset maps to. Then, each of these (Key, 
Value) pairs gets sent to and processed by the machine re- 
sponsible for its Key. This machine contains all data points 
mapping to the same query or offset hash bucket. Then, it 
can perform a search within the data points which also map 
to the same Key and report the close points. This search 
can be done using the UDF in Active DHT or the Reducer 
in MapReduce. 

In the above implementation, the amount of network com- 
munication per query is directly proportional to the number 
of hash buckets that need to be checked. However, as men- 
tioned earlier, this number is large for both basic LSH and 
Entropy LSH. Hence, in large scale applications, where ei- 
ther there is a huge batch of queries or the queries arrive in 
real-time at very high rates, this will require a lot of commu- 
nication, which not only depletes the valuable network re- 
sources in a shared environment, but also significantly slows 
down the query search process. In this paper, we propose 
an alternative way, called Layered LSH, to implement the 
Entropy LSH scheme in a distributed framework and prove 
that it exponentially reduces the network load compared to 
the above implementation, while maintaining a good load 
balance between different machines. 

1.2 Overview of Our Scheme 

At its core, Layered LSH is a carefully designed implementa- 
tion of Entropy LSH in the distributed (Key, Value) model. 
The main idea is to distribute the hash buckets such that 
near points are likely to be on the same machine (hence net- 
work efficiency) while far points are likely to be on different 
machines (hence load balance). 

This is achieved by rehashing the buckets to which the data 
points and the offsets of query points map to, via an addi- 
tional layer of LSH, and then using the hashed buckets as 
Keys. More specifically, each data point is associated with 
a (Key, Value) pair where Key is the mapped value of LSH 
bucket containing the point, and Value is the point's hash 
bucket concatenated with the point itself. Also, each query 
point is associated with multiple (Key, Value) pairs where 
Value is the query itself and Keys are the mapped values of 
the buckets which need to be searched in order to answer 
this query. 

Use of an LSH to rehash the buckets not only allows using 
the proximity of query offsets to bound the number of (Key, 
Value) pairs for each query (thus guaranteeing network ef- 
ficiency), but also ensures that far points are unlikely to 
be hashed to the same machine (thus maintaining load bal- 
ance). 



1.3 Our Results 

Here, we present a summary of our results in this paper: 

1. We prove that Layered LSH incurs only 0(\/logn) net- 
work cost per query. This is an exponential improve- 
ment over the 0(n e ' 1 - ) ) query network cost of the sim- 
ple distributed implementation of both Entropy LSH 
and basic LSH. 

2. Surprisingly, we prove that, the network efficiency of 
Layered LSH is independent of the search quality. This 
is in sharp contrast with both Entropy LSH and ba- 
sic LSH in which increasing search quality directly in- 
creases the network cost. This offers a very large im- 
provement in both network efficiency and hence over- 
all run time in settings which require similarity search 
with high accuracy. We also present experiments which 
verify this observation on the MapReduce framework. 

3. We prove that despite network efficiency (which re- 
quires collocating near points on the same machines), 
Layered LSH sends points which are only fi(l) apart 
to different machines with high likelihood. This shows 
Layered LSH hits the right tradeoff between network 
efficiency and load balance across machines. 

4. We present experimental results with Layered LSH on 
Hadoop, which show it also works very well in practice. 

The organization of this paper is as follows. In section[2] we 
study the Basic and Entropy LSH indexing methods. In sec- 
tion O we give the detailed description of Layered LSH, in- 
cluding its pseudocode for the MapReduce and Active DHT 
frameworks, and also provide the theoretical analysis of its 
network cost and load balance. We present the results of 
our experiments on Hadoop in section [4] study the related 
work in section [5j and conclude in section [6] 

2. PRELIMINARIES 

In sect ion [i~n we provided the high-level background needed 
for this paper. Here, we present the necessary preliminaries 
in further detail. Specifically, we formally define the similar- 
ity search problem, the notion of LSH functions, the basic 
LSH indexing, and Entropy LSH indexing. 

Similarity Search: As mentioned in section fTTTI similarity 
search in a metric space with domain T reduces to the prob- 
lem more commonly known as the (c, r)-NN problem, where 
given an approximation ratio c > 1, the goal is to construct 
an index that given any query point q £ T within distance r 
of a data point, allows for quickly finding a data point p 6 T 
whose distance to q is at most cr. 

Basic LSH: To solve the (c, r)-NN problem, Indyk and 
Motwani [21] introduced the following notion of LSH func- 
tions: 



Definition 1. For the space T with metric given dis- 
tance threshold r, approximation ratio c > 1, and probabili- 
ties pi > p2, a family of hash functions H = {h : T — > U} 



is said to be a (r,cr,p\,p2)-LSH family if for allx,y £ T, 

if({x,y) < r then Pr« [h(x) = h(y)] > pi, _ 
ifC(p,y) > cr then Pr n [h(x) = h(y)] <p 2 . 

Hash functions drawn from H have the property that near 
points (with distance at most r) have a high likelihood (at 
least pi) of being hashed to the same value, while far away 
points (with distance at least cr) are less likely (probability 
at most P2) to be hashed to the same value; hence the name 
locality sensitive. 

LSH families can be used to design an index for the (c, r)- 
NN problem as follows. First, for an integer k, let ~H' = 
{H : T -> [/'} be a family of hash functions in which any 
H £ H' is the concatenation of k functions in H, i.e., H = 
(hi, hi, ■ ■ ■ , hk), where hi £ H (1 < i < k). Then, for an 
integer M, draw M hash functions from H' , independently 
and uniformly at random, and use them to construct the 
index consisting of M hash tables on the data points. With 
this index, given a query q, the similarity search is done 
by first generating the set of all data points mapping to the 
same bucket as q in at least one hash table, and then finding 
the closest point to q among those data points. The idea is 
that a function drawn from H' has a very small chance (pf) 
to map far away points to the same bucket (hence search 
efficiency), but since it also makes it less likely (pi) for a 
near point to map to the same bucket, we use a number, M, 
of hash tables to guarantee retrieving the near points with 
a good chance (hence search quality). 

To utilize this indexing scheme, one needs an LSH family H 
to start with. Such families are known for a variety of metric 
spaces, including the Hamming distance, the Earth Mover 
Distance, and the Jaccard measure |10] , Furthermore, Datar 
et al. 15 proposed LSH families for l p norms, with < p < 
2, using p-stable distributions. For any W > 0, they consider 
a family of hash functions Hw ■ {h a ,b ■ R d — > Z} such that 



where a £ R is a d-dimensional vector each of whose en- 
tries are chosen independently from a p-stable distribution, 
and b £ R is chosen uniformly from [0, W]. Further im- 
provements have been obtained in various special settings 
[4]. In this paper, we will focus on the most widely used p- 
stable distribution, i.e., the 2-stable, Gaussian distribution. 
For this case, Indyk and Motwani [21 proved the following 
theorem: 

Theorem 2. With n data points, choosing k — O(logn) 
and M — 0(n 1//c ), the LSH indexing scheme above solves 
the (c,r)-NN problem with constant probability. 

Although Basic LSH yields a significant improvement in the 
running time over both the brute force linear scan and the 
space partitioning approaches 33, 7 23 , unfortunately the 
required number of hash functions is usually large [9j 118] . 
which entails a very large space requirement for the index. 
Also, in the distributed setting, each hash table lookup at 
query time corresponds to a network call which entails a 
large network load. 



Entropy LSH: To mitigate the space inefficiency, Pani- 
grahy 29 introduced the Entropy LSH scheme. This scheme 
uses the same indexing as in the basic LSH scheme, but a dif- 
ferent query search procedure. The idea here is that for each 
hash function H 6 H' , the data points close to the query 
point q are highly likely to hash either to the same value 
as H(q) or to a value very close to that. Hence, it makes 
sense to also consider as candidates the points mapping to 
close hash values. To do so, in this scheme, in addition to 
q, several "offsets" q + Si (1 < i < L), chosen randomly from 
the surface of B(q,r), the sphere of radius r centered at q, 
are also hashed and the data points in their hash buckets are 
also considered as search result candidates. It is conceivable 
that this may reduce the number of required hash tables, 
and in fact, Panigrahy [25] shows that with this scheme one 
can use as few as 0(1) hash tables. The instantiation of his 
result for the I2 norm is as follows: 



Theorem 3. For n data points, choosing k > log °i™ P2 ) 
(with P2 as in Definition^ and L — 0(n 2 ^ c ), as few as 
0(1) hash tables suffice to solve the (c,r)-NN problem. 

Hence, this scheme in fact significantly reduces the num- 
ber of required hash tables (from 0(n}^ c ) for basic LSH to 
O(l)), and hence the space efficiency of LSH. However, in 
the distributed setting, it does not help with reducing the 
network load of LSH queries. Actually, since for the basic 
LSH, one needs to look up M = (^(n 1 ^) buckets but with 
this scheme, one needs to look up L = 0(n 2//c ) offsets, it 
makes the network inefficiency issue even more severe. 

3. DISTRIBUTED LSH 

In this section, we will present the Layered LSH scheme and 
theoretically analyze it. We will focus on the d-dimensional 
Euclidian space under h norm. As notation, we will let 5* 
to be a set of n data points available a-priori, and Q to 
be the set of query points, either given as a batch (in case 
of MapReduce) or arriving in real-time (in case of Active 
DHT). Parameters k,L, W and LSH families rl = Hw and 
H' = H'w will be as defined in section [2] Since multiple 
hash tables can be obviously implemented in parallel, for 
the sake of clarity we will focus on a single hash table and 
use a randomly chosen hash function H £ "H' as our LSH 
function throughout the section. 

In (Key, Value) based distributed systems, a hash function 
from the domain of all Keys to the domain of available ma- 
chines is implicitly used to determine the machine responsi- 
ble for each (Key, Value) pair. In this section, for the sake 
of clarity, we will assume this mapping to be simply iden- 
tity. That is, the machine responsible for a (Key, Value) 
data element is simply the machine with id equal to Key. 

At the core, Layered LSH is a carefully distributed imple- 
mentation of Entropy LSH. Hence before presenting it, first 
we further detail the simple distributed implementation of 
Entropy LSH, described in section [TTT1 and explain its ma- 
jor drawback. For any data point p £ S a (Key, Value) pair 
(H(p),p) is generated and sent to machine H(p). For each 
query point q, after generating the offsets q + Si (1 < i < L), 



for each unique value x in the set 



{H(q + 5i)\l<i<L)}, 

a (Key, Value) pair (x, q) is generated and sent to machine 
x. Hence, machine x will have all the data points p G S 
with H(p) = x as well as all query points q G Q such that 
H(q + Si) = x for some 1 < i < L. Then, for any received 
query point q, this machine retrieves all data points p with 
H(p) = x which are within distance cr of q, if any such data 
points exist. This is done via a UDF in Active DHT or the 
Reducer in MapReduce, as presented in Figure 13.11 for the 
sake of concreteness of exposition. 

In this implementation, the network load due to data points 
is not very significant. Not only just one (Key, Value) pair 
per data point is transmitted over the network, but also in 
many real-time applications, data indexing is done offline 
when efficiency and speed are not as critical. However, the 
amount of data transmitted per query in this implementa- 
tion is O(Ld): L (Key, Value) pairs, one per offset, each with 
the d-dimensional point q as Value. Both L and d are large 
in many practical applications with high-dimensional data 
(e.g., L can be in the hundreds, and d in the tens or hun- 
dreds). Hence, this implementation needs a lot of network 
communication per query, and with a large batch of queries 
or with queries arriving in real-time at very high rates, this 
will not only put a lot of strain on the valuable and usually 
shared network resources but also significantly slow down 
the search process. 

Therefore, a distributed LSH scheme with significantly bet- 
ter query network efficiency is needed. This is where Layered 
LSH comes into the picture. 

3.1 Layered LSH 

In this subsection, we present the Layered LSH scheme. The 
main idea is to use another layer of locality sensitive hashing 
to distribute the data and query points over the machines. 
More specifically, given a parameter value D > 0, we sample 
an LSH function G : R* — > Z such that: 



G(v) 



1 D J 



(3.1) 



where a G R fc is a fc-dimensional vector whose individual 
entries are chosen from the standard Gaussian A/"(0, 1) dis- 
tribution, and P 6 R is chosen uniformly from [0, D]. 

Then, denoting G(H(-)) by GH(-), for each data point p £ 
S, we generate a (Key, Value) pair (GH(p),< H(p),p > 
), which gets sent to machine GH(p). By breaking down 
the Value part to its two pieces, H(p) and p, this machine 
will then add p to the bucket H(p). This can be done by 
the Reducer in MapReduce, and by a UDF in Active DHT. 
Similarly, for each query point q £ Q, after generating the 
offsets q + Si (1 < i < L), for each unique value x in the set 



{GH(q + Si)\ 1 < i < L} 



(3.2) 



we generate a (Key, Value) pair (x, q) which gets sent to 
machine x. Then, machine x will have all the data points 
p such that GH(p) — x as well as the queries q G Q one 



of whose offsets gets mapped to x by GH(-). Specifically, 
if for the offset q + Si, we have GH(q + Si) — x, all the 
data points p that H(p) — H(q + Si) are also located on 
machine x. Then, this machine regenerates the offsets q + Si 
(1 < i < L), finds their hash buckets H(q + Si), and for any 
of these buckets such that GH(q + Si) — x, it performs a 
similarity search among the data points in that bucket. Note 
that since q is sent to this machine, there exists at least one 
such bucket. Also note that, the offset regeneration, hash, 
and bucket search can all be done by either a UDF in Active 
DHT or the Reducer in MapReduce. To make the exposition 
more concrete, we have presented the pseudo code for both 
the MapReduce and Active DHT implementations of this 
scheme in Figure [3~2l 



At an intuitive level, the main idea in Layered LSH is that 
since G is an LSH, and also for any query point q, we have 
H(q + Si) ~ H(q) for all offsets q + 5i (1 < i < L), the set in 
equation 13.21 has a very small cardinality, which in turn im- 
plies a small amount of network communication per query. 
On the other hand, since G and H are both LSH functions, 
if two data points p,p' are far apart, GH(p) and GH(p') are 
highly likely to be different. This means that, while locating 
the nearby points on the same machines, Layered LSH par- 
titions the faraway data points on different machines, which 
in turn ensures a good load balance across the machines. 
Note that this is critical, as without a good load balance, 
the point in distributing the implementation would be lost. 

In the next section, we present the formal analysis of this 
scheme, and prove that compared to the simple implementa- 
tion, it provides an exponential improvement in the network 
traffic, while maintaining a good load balance across the 
machines. 

3.2 Analysis 

In this section, we analyze the Layered LSH scheme pre- 
sented in the previous section. We first fix some notation. 
As mentioned earlier in the paper, we are interested in the 
(c, r)-NN problem. Without loss of generality and to sim- 
plify the notation, in this section we assume r = 1/c This 
can be achieved by a simple scaling. The LSH function 
H G W' w that we use is H = (Hi, . . . , -H&), where k is 
chosen as in Theorem [3] and for each 1 < i < k: 



Hi(v) = L- 



V + h 



where is a d-dimensional vector each of whose entries 
is chosen from the standard Gaussian A/"(0, 1) distribution, 
and bi G R is chosen uniformly from [0, W]. We will also let 



be T = 



Ti(v) = 



), where for 1 < i < k: 
v + bi 



hence, Hi(-) = [Ti(-)J 
lemma in our analysis: 



We will use the following small 



Lemma 4. For any two vectors u, v £ R d , we have: 
||r(u) — T(v )\\-Vk < \\H(u)-H(v)\\ < ||r(M)-r(t;)|j+V / fc 

Proof. Denoting E4 = Ti — Hi (1 < i < k) and R = 
(Ri, . . .,R k ), we have < Ri(u),Ri(v) < 1 (1 < i < k), and 



Algorithm 1 MapReduce Implementation 



Algorithm 2 Active DHT Implementation 



Map: 

Input: Data set S, query set Q 

Choose H from rl'w uniformly at random, but consis- 
tently across Mappers 
for each data point p G S do 

Emit (H(p),p) 
end for 

for each query point q G Q do 
for 1 < i < L do 

Choose the offset q + 8i from the surface of B(q, r) 

Emit (H(q + Si),q) 
end for 
end for 

Reduce: 

Input: For a hash bucket x, all data points p £ S 
with H(p) = x, and all query points q £ Q one of 
whose offsets hashes to x. 

for each query point q among the input points do 
for each data point p among the input points do 
if p is within distance cr of q then 

Emit (q,p) 
end if 
end for 
end for 



Preprocessing: 

Input: Data set S 

for each data point p G S do 

Compute the hash bucket y = H{p) 

Send the pair (y,p) to machine with id y 

At machine y add p to the in-memory bucket y 

end for 

Query Time: 

Input: Query point q G Q arriving in real-time 

for 1 < i < L do 

Generate the offset q + Si 
Compute the hash bucket x — H(q + Si) 
Send the pair (x, q) to machine with id x 
At machine x, run SearchUDF(a;, q) 

end for 

SearchUDF(>,g): 

for each data point p with H(p) = x do 
if p is within distance cr of q then 

Emit (q,p) 
end if 

end for 



Figure 3.1: Simple Distributed LSH 



hence \ \R(u) - R(v)\ \ < y/k. Also, by definition H = T- R, 
and hance H(u) - H(v) = (r(u) - T(v)) + (R(v) - R(u)). 
Then, the result follows from triangle inequality. □ 

Our analysis also uses two well-known facts. The first is 
the sharp concentration of ^-distributed random variables, 
which is also used in the proof of the Johnson-Lindenstrauss 
lemma [211114] . and the second is the 2-stability property of 
Gaussian distribution: 



the load balance of Layered LSH and derive a formula for 
it, again based on D. Intuitively speaking, a large value of 
D tends to put all points on one or few machines, which is 
undesirable from the load balance perspective. Our analy- 
sis will formulate this dependence and show its exact form. 
These two results together will then show the exact trade- 
off governing the choice of D, which we will use to prove 
(in Corollary 1 12p that with an appropriate choice of D, Lay- 
ered LSH achieves both network efficiency and load balance. 
Before proceeding to the analysis, we give a definition: 



Fact 5. // ui £ R m is a random m-dimensional vector 
each of whose entries is chosen from the standard Gaussian 
A/ r (0, 1) distribution, andm — Q( lc ^" ), then with probability 
at least 1 — e(i) > we have 

(1 - i)\fm < < (1 + e)\/m 

Fact 6. If 8 is a vector each of whose entries is chosen 
from the standard Gaussian A/"(0, 1) distribution, then for 
any vector v of the same dimension, the random variable 
9-v has Gaussian A/"(0, 1 1«| |) distribution. 

The plan for the analysis is as follows. We will first analyze 
(in theorem[8]) the network traffic of Layered LSH and derive 
a formula for it based on D, the bin size of LSH function 
G. We will see that as expected, increasing D reduces the 
network traffic, and our formula will show the exact rela- 
tion between the two. We will next analyze (in theorem [11} 



Definition 7. Having chosen LSH functions G, H , for a 
query point q G Q, with offsets q + Si (1 < i < L), define 

f q = \{GH(q + 5i)\l<i<L}\ 

to be the number of (Key, Value) pairs sent over the network 
for query q. 

Since q is d-dimensional, the network load due to query q is 
0(df q ). Hence, to analyze the network efficiency of Layered 
LSH, it suffices to analyze f q . This is done in the following 
theorem: 

Theorem 8. For any query point q, with high probability, 
that is probability at least 1 eTTJ; we have: 



Algorithm 3 MapReduce Implementation 



Algorithm 4 Active DHT Implementation 



Map: 

Input: Data set S, query set Q 

Choose hash functions H, G randomly but consistently 
across mappers 
for each data point p G S do 
Emit (GH(p),< H(p),p >) 
end for 

for each query point q G Q do 
for 1 < i < L do 

Generate the offset q + Si 
Emit (GH(q + 5i),q) 
end for 
end for 

Reduce: 

Input: For a hash bucket x, all pairs < H(p),p > 
for data points p G S with GH(p) — x, and all query 
points q G Q one of whose offsets is mapped to x by 
GH. 

for each data point p among the input points do 

Add p to bucket H (p) 
end for 

for each query point q among the input points do 
for 1 < i < L do 

Generate the offset q + Si, and find H(q + Si) 
if GH(q+6 i )=x,H(q + 5i)=tH(q + 5 :j ) (Vj < i) 
then 

for each data point p in bucket H(q + Si) do 
if (p is within distance cr of q) then 

Emit (q,p) 
end if 
end for 
end if 
end for 
end for 



Preprocessing: 

Input: Data set S 

for each data point p G S do 

Compute the hash bucket H (p) and machine id y = 

GH(p) 

Send the pair (y, < H(p),p >) to machine with id y 

At machine y, add p to the in-memory bucket H(p) 
end for 

Query Time: 

Input: Query point q G Q arriving in real-time 
for 1 < i < L do 

Generate the offset q + Si, compute x = GH(q + Si) 
if GH{q + Sj) / x (Vj < i) then 

Send the pair (x, q) to machine with id x 
At machine x, run SearchUDF(;r, q) 
end if 
end for 

SearchUDF(x,g): 
for 1 < i < L do 

Generate offset q + Si, compute H(q + 8i), GH(q + 5i) 

if GH(q + 5i) = x, H{q + Si) H{q + 5 3 ) (Vj < i) 
then 

for each data point p in bucket H(q + Si) do 
if p is within distance cr from q then 

Emit (q,p) 
end if 
end for 
end if 
end for 



Figure 3.2: Layered LSH 



Proof. Since for any offset q + Si, the value GH(q + Si) 
is an integer, we have: 

f q < max {GH{q + 5i)-GH{q + 5j)} (3.3) 

l<i,j<L 



For any vector v, we have 
a.v + P 



D 



1 < G(v) < 



a.v + /3 
D 



hence for any 1 < i, j < L: 

GH(q + S t ) GH(q + Sj )< "•(g(g + *0 > -g(g + fr)) + 1 



Thus, from equation 13.31 we get: 

/, < ^ max {a ■ (H(q + S t ) - H(q + S,))} + 1 

Lf l<2,j<L 

From Cauchy-Schwartz inequality for inner products, we 
have for any 1 < i, j < L: 

a • (H(q + S t ) - H(q + 5,)) < \\a\ | ■ \\H(q + Si) - H(q + 



Hence, we get: 

/, < ^ -max {\\H(q + 5i) - Hiq + S,)]]} + 1 (3.4) 
For any 1 < i,j < L, we know from lemma [4] 

\\H( q +5i) -^(9+^)11 < \\r(q+Si)-r(q+8j)\\+Jk (3-5) 

Furthermore for any 1 < t < k, since 

a t ■ (Si — Sj) 



Ttiq + S^-Ftiq + Sj) = 



ir 



we know, using Fact [6] that T t (q + Si) — T t (q + Sj) is dis- 



tributed as Gaussian A/"(0, 



). Now, recall from theo- 



rem[3]that for our LSH function, k > — °. g " , . Hence, there 

1 ' — log(l/p 2 ) ' 

is a constant e — e(p2) < 1 for which we have, using Fact [5] 

||r(« + S t ) - T(g + 5j)\\ < (1 + e)Vk 1 ^^ 

with probability at least 1 — l/n®' 1 - 1 . Since, as explained in 
section [2] all offsets are chosen from the surface of the sphere 
B(q, 1/c) of radius 1/c centered at q, we have: ||<5i — Sj\\ < 



2/c. Hence overall, for any 1 < i, j < k: 

||r( g + «5 l )-r( g + «5 J )||<2(i + e )^<4^ 

with high probability. Then, since there are only L 2 different 
choices of and L is only polynomially large in n, we get: 

max {||r(g + *0 - r(g + ^)ll} < ^ 

1<i,j<L CW 



with high probability. Then, using equation l3.5l we get with 
high probability: 

max {\\H{q + 8i)-H{q + 8j)\\} < (1 + -L)Vk (3.6) 

l<i,]<L CW 

Furthermore, since each entry of a G R fc is distributed as 
A/"(0, 1), another application of Fact [5] gives (again with e = 



(3.7) 



||a|| < (1 + e)Vk < 2Vk 

with high probability. Then, equations 13.41 13.61 and 13.71 

together give: 

which finishes the proof. □ 



Remark 9. A surprising property of Layered LSH demon- 
strated by theorem^ is that the network load is independent 
of the number of query offsets, L. Note that with Entropy 
LSH, to increase the search quality one needs to increase 
the number of offsets, which will then directly increase the 
network load. Similarly, with basic LSH, to increase the 
search quality one needs to increase the number of hash ta- 
bles, which again directly increases the network load. How- 
ever, with Layered LSH the network efficiency is achieved 
independently of the level of search quality. Hence, search 
quality can be increased without any effect on the network 
load! 



Next, we proceed to analyzing the load balance of Layered 
LSH. First, recalling the classic definition of error function: 

erf(z) = -^= / e~ T dr 



we define the function P(- 



P{z) =erf(z) — (l-e _z ) 



(3.8) 



and prove the following lemma: 



Lemma 10. For any two points u, v £ M. k with \ \u — v\ \ = 
A, we have: 

Pr[G(u) = G(v)] = P(-^) 



Proof. Since fj is uniformly distributed over [0, D], we 
have: 



Pr\G{u) = G(v)\ a-{u-v) = l) = max <^ 0, 1 



D 



Then, since by Fact [6] a ■ (u — v) is distributed as Gaussian 
Af(0, A), we have: 

P r[ G { u)=G {v)] = jy-^^e-i^m 

f D 11 i 2 

= 2 (1 )^^e~^dl 

Jo D'^X 

D V2 f D lV2 

e 2>2 dl — / — — — =e 2a2 dl 



= erf( 
= P( 



a v 7 ^ Jo DX^/n 

D , l~2 X 



V2A V 7T D 

D s 



(1-e ^) 



V2X' 



□ 



One can easily see that P(-) is a monotonically increasing 
function, and for any < £ < 1 there exists a number z — 
such that P(z() = Using this notation and the previous 
lemma, we prove the following theorem: 



Theorem 11. For any constant < £ < 1, there is a X^ 
such that 

A ? /W = 0(l+-5=) 

and for any two points u, v with \\u — v\\ > Af , we have: 

Pr[GH{u) = GH(v)] < £ + o(l) 
where o(l) is polynomially small in n. 

Proof. Let u, v £ R d be two points and denote | it— v\ | = 
A. Then by lemma 3] we have: 

\\H(u)-H(v)\\>\\r(u)-r(v)\\-Vk 
As in the proof of theorem|8l one can see, using k > — ogn 



log (1/P2) 

(from theorem[3]) and Fact0 that there exists an e = e{j>2) = 
0(1) such that with probability at least 1 we have: 



||r(«)-r(«)|| >(i-e) 



XVk 

w 



Hence, with probability at least 1 W(T7> we have: 



\H(u)-H(v)\\> A' = ((i- e )A_i)Vfc 



Now, letting: 



A* 



:(1 



D 



)W 



1-e z^2k' 
we have if A > A^ then A' > jB^ , and hence by lemma [TOl 

Pr[GH(u) = GH(v)\ \\H(u) - H(v)\\ > A'] < P(^) < £ 

which finishes the proof by recalling Pr[\\H(u) — H(v)\\ < 
A'] = o(l). □ 



Theorems [8] QT] show the tradeoff governing the choice of pa- 
rameter D. Increasing D reduces network traffic at the cost 



of more skewed load distribution. We need to choose D such 
that the load is balanced yet the network traffic is low. The- 
orem [TT] shows that choosing D — o(yk) does not asymp- 
totically help with the distance threshold at which points 
become likely to be sent to different machines. On the other 
hand, theorem If f I also shows that choosing D = uj(\fk) is 
undesirable, as it unnecessarily skews the load distribution. 
To observe this more clearly, recall that intuitively speaking, 
the goal in Layered LSH is that if two data point p\ , p2 hash 
to the same values as two of the offsets q+8i,q + 5j (for some 
1 < h j < L) of a query point q (i.e., H(pi) = H(q + Si) 
and H(p2) = H(q + Sj)), then pi,P2 are likely to be sent to 
the same machine. Since H has a bin size of W , such pair 
of points pi,P2 most likely have distance 0(W). Hence, 
D should be only large enough to make points which are 
0(W) away likely to be sent to the same machine. Theorem 
If f I shows that to do so, we need to choose D such that: 

0(l + iL)=0(l) 

that is D = 0{\fk). Then, by theorem [8] to minimize 
the network traffic, we choose D — @(yk), and get f q = 
0(\fk) — 0(\/k>g n). This is summarized in the following 
corollary: 

Corollary 12. Choosing D — 6(Vfc) , Layered LSH guar- 
antees that the number of (Key, Value) pairs sent over the 
network per query is 0(y/\ogn) with high probability, and 
yet points which are Q(W) away get sent to different ma- 
chines with constant probability. 

Remark f 3. Corollary shows that, compared to the 
simple distributed implementation of Entropy LSH and basic 
LSH, Layered LSH exponentially improves the network load, 
from 0(n & ^) to 0{\/\ogn), while maintaining the load bal- 
ance across the different machines. 

4. EXPERIMENTS 

fn this section, we present an experimental comparison of 
Simple and Layered LSH via the MapReduce framework 
with respect to the network cost (shuffle size) and "wall- 
clock" run time for a number of data sets. Secondly, we 
compare Layered LSH in Section [3] with Sum and Cauchy 
distributed LSH schemes described in Haghani et al. [20] . 
Finally, we also analyze the results by considering the load 
balance properties. 

4.1 Datasets 

First, we describe the data sets we used. 

• Random: This data set is constructed by sampling 
points from N d (0, 1 with d = fOO and the queries are 
generated by adding a small perturbation drawn from 
iV d (0, r) to a randomly chosen data point, where r = 
0.3. We use 1M data points and fOOK queries. This 
"planted" data set has been used for LSH experiments 

1 N d (0, r) denotes the normal distribution around the origin, 
G K d , where the i-th coordinate of a randomly chosen 
point has the distribution JV(0, r/\fd), Vi G f . . . d 



in [15] and we solve the (c, r)-NN problem on it with 
c = 2. The parameter choice is such that for each 
query point, the expected distance to its closest data 
point is r and that with high probability only that data 
point is within distance cr from it. 

We use the English Wikipedia corpus from 
February 20f2 to compute TF-fDF vectors for each 
document in it after removing stop words, stemming, 
and removing insignificant words (appearing fewer than 
20 times in the corpus). We partition the 3.75M arti- 
cles in the corpus randomly into a data set of size 3M 
and a query set of size 750K. We solve the (c, r)-NN 
problem with r = O.f and c = 2. 

• Image |2j: The Tiny Image Data set consists of al- 
most 80M "tiny" images of size 32 x 32 pixels [2] . We 
extract a 64-dimensional color histogram from each im- 
age in this data set using the extractcolorhistogram tool 
in the FIRE image search engine, as described in |27l 
If 7| and normalize it to unit norm in the preprocessing 
step. 1M Data points and 200K queries are sampled 
randomly ensuring no overlap. The avg. distance of 
a query to its closest data point is estimated, through 
sampling, to be 0.08 (with standard deviation 0.07), 
and hence, we solve the (c, r)-NN problem on this data 
set with (r = 0.08, c = 2). 

4.2 Implementation Details 

We perform experiments on a small cluster of f 3 compute 
nodes using Hadoop f with 800MB JVMs to implement 
Map and Reduce tasks. Consistency in the choice of hash 
functions H, G (Section [3} as well as offsets across mappers 
and reducers is ensured by setting the seed of the random 
number generator appropriately. 

We choose the LSH parameters (W = 0.5, k = f 0) for the 
Random data set, (W = 0.3, k — I 6) for the Image data 
set, and (W = 0.5, k = f2) for the Wiki data set according 
to the calculations in [25] , and experiments in |27l |3"T] . We 
optimized D, the parameter of Layered LSH, using a simple 
binary search to minimize the wall-clock run time. 

Since the underlying dimensionality (vocabulary size 549532) 
for the Wiki data set is large, we use the Multi-Probe LSH 
(MPLSH) [27] as our first layer of hashing for that data set. 
We discuss MPLSH in detail in Section We measure the 
accuracy of search results by computing the recall, i.e., the 
fraction of query points with at least one data point within 
a distance r, returned in the output. 

4.3 Results 

f . Comparison with Simple LSH: Figure l4~T1 describes 

the results of scaling L, the number of offsets per query, 
on the recall, shuffle size and wall-clock run time us- 
ing a single hash table. Note that the recall of Entropy 
LSH can be improved by using 0(1) hash tables. Since 
improving recall is not the main aspect of this paper, 
we use just a single hash table for all our experiments. 
We observe that even with a crude binary search for D, 
on average Layered LSH provides a factor 3 improve- 
ment over simple LSH in the wall-clock run time on 

2 http://download. wikimedia.org 
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Figure 4.1: Variation in recall, shuffle size and wall-clock run time with L for Random, Wiki and Image data sets. 
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Figure 4.2: Variation in wall-clock run time for the "Sum", 
"Cauchy" and Layered LSH schemes with increasing L 



account of a factor 10 or more decrease in the shuffle 
size. Further, improving recall by increasing L results 
in a linear increase in the shuffle size for Simple LSH, 
while, the shuffle size for Layered LSH remains almost 
constant. This observation verifies Theorem [8] and Re- 
mark 9. Note that since Hadoop offers checkpointing 
guarantees where (Key, Value) pairs output by map- 
pers and reducers may be written to the disk, Layered 
LSH also decreases the amount of data written to the 
distributed file system. 

2. Comparison with the "Sum" and "Cauchy" dis- 
tributed LSH schemes described in [20j: We 

compare Layered LSH with Sum and Cauchy dis- 
tributed LSH schemes described in Haghani et al. [2D] . 
Figure 14.21 shows that Layered LSH compares favor- 
ably with Cauchy schemeQfor the Wiki data set. The 
MapReduce job for Sum failed due to reduce task run- 
ning out of memory, indicating load imbalancfl This 
can be also seen as a manifestation of "the curse of the 
last reducer" 1521 . 



3 associated parameter chosen via a crude binary search to 
minimize runtime 

4 Recall that reduce tasks store data points in memory 





Average 


Max 


Simple LSH 


3K 


9K 


Sum 


3K 


214K 


Cauchy 


3K 


45K 


Layered LSH 


3K 


103K 



Table 1: Wiki data set: distribution of data points across 
1024 reduce tasks 



Load Balance: Next, we discuss the distribution (average 
and max) of data points in the Wiki data set to 1024 reduce 
tasks for the different distribution schemes. 

First, Table [1] above demonstrates Sum has the most im- 
balanced load distribution, explaining its failure on MapRe- 
duce. Second, Simple LSH, while having the best load bal- 
ance for the Wiki data set, incurs a large network cost in 
order to achieve this load balance. In contrast, Layered 
LSH offers a tunable way of trading off load balance to de- 
crease network cost and minimize the wall-clock run time. 
Although Cauchy compares favorably to Layered LSH in 
load balance, it is worse of in running time. In addition, it 
is not clear if it is possible to provide any theoretical guar- 
antees for the Cauchy scheme. 

5. RELATED WORK 

Locality Sensitive Hashing (LSH) was introduced by Indyk 
and Motwani in order to solve high dimensional similarity 
search problems [2T]. LSH indexing methods are based on 
LSH families of hash functions for which near points have a 
higher likelihood of hashing to the same value. Then, (c, r)- 
NN problem can be solved by using multiple hash tables. 
Gionis et al. [18] showed that in the Euclidian space 0(n 1,/c ) 
hash tables suffice, which was later improved, by Datar et al. 

na, to o^i") (for some /3 < 1), and further, by Andoni 

and Indyk [3], to 0(n 1/ ^ c ) which almost matches the lower 
bound proved by Motwani et al. [28]. LSH families are also 
known for several non-Euclidian metrics, such as Jaccard 
distance [8] and cosine similarity [10] , 

The main problem with LSH indexing is that to guarantee 
a good search quality, it requires a large number of hash ta- 
bles. This entails a large index space requirement, and in the 
distributed setting, also a large amount of network commu- 
nication per query. To mitigate the space inefficiency, Pani- 
grahy ;29, proposed Entropy LSH which, by also looking up 
the hash buckets of 0(n 2 ' c ) random query "offsets", requires 
just O(l) hash tables, and hence provides a large space im- 
provement. But, Entropy LSH does not help with and in 
fact worsens the network inefficiency of conventional LSH: 
each query, instead of 0(n 1 ' c ) network calls, one per hash 
table, requires 0{n 2/c ) calls, one per offset. Our Layered 
LSH scheme exponentially improves this and, while guaran- 
teeing a good load balance, requires only O(yTogri) network 
calls per query. 

To reduce the number of offsets required by Entropy LSH, 
Lv et al. [27] proposed the Multi-Probe LSH (MPLSH) 
heuristic, in which a query-directed probing sequence is used 
instead of random offsets. They experimentally show this 
heuristic improves the number of required offset lookups. In 



a distributed setting, this translates to a smaller number 
of network calls per query and Layered LSH can be imple- 
mented by using MPLSH instead of Entropy LSH as the first 
"layer" of hashing, as demonstrated by experiments on the 
Wiki data set in section [3] Hence, the benefits of the two 
methods can be combined in practice. 

Haghani et al. 20] describe the Sum and Cauchy schemes 
which map LSH buckets to peers in p2p networks in order 
to minimize network costs. However, in contrast to Lay- 
ered LSH, no guarantees on network cost and load balance 
are provided. In this paper, we show via Map Reduce ex- 
periments on the Wiki data set that Sum distributes data 
unevenly and thus may load some of the reduce tasks. In ad- 
dition we also describe experiments which demonstrate that 
Layered LSH compares favorably with Cauchy on this data 
set. 

6. CONCLUSIONS 

We presented and analyzed Layered LSH, an efficient dis- 
tributed implementation of LSH similarity search indexing. 
We proved that, compared to the straightforward distributed 
implementation of LSH, Layered LSH exponentially improves 
the network load, while maintaining a good load balance. 
Our analysis also showed that, surprisingly, the network load 
of Layered LSH is independent of the search quality. Our ex- 
periments confirmed that Layered LSH results in significant 
network load reductions as well as runtime speedups. 
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